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A METHOD OF MANAGING THE DECODING AND PLAYBACK OF A SOUND 
SIGNAL IN AN ASYNCHRONOUS TRANSMISSION SYSTEM 

The present invention relates to a method of 
managing asynchrony in audio transmission. 
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GENERAL DESCRIPTION OF THE*FIELD OF THE INVENTION 

In general, the invention relates to transmission 
systems using low data rate speech encoders, in which the 
signals do not carry the reference clock of the source 
10 encoding system (the sampling frequency of the encoder) . 

This applies/ for example, to Internet protocol (IP) type 
transmissions or indeed to discontinuous transmissions, 
etc- 

A general aim of the invention is to resolve the 

15 problems encountered by such systems in producing a 
continuous stream of decoded speech or sound. 

Traditionally, telephone communications and sound 
channel networks have used analog transmission systems 
with frequency division multiplexing (primary groups, 

20 amplitude and frequency modulation) • Under such 

conditions, the speech signal (or music signal; the term 
"speech" is used below throughout this document in 
generic manner) is converted into an electrical signal by 
a microphone and it is this analog signal which is 

25 filtered and modulated in order to be presented to a 

receiver which amplifies it prior to presenting it to a 
playback system (earphone, loudspeaker, etc.). 

Over the last few years, digital transmission and 
switching techniques have been progressively replacing 

30 analog techniques- In pulse code modulation (PCM) 

systems, the speech signal is sampled and converted into 
a digital signal using an analog-to-digital converter 
(ADC) driven at a fixed sampling frequency derived from a 
master clock delivered by the network and also known to 

35 the receiver system. This applies to analog and digital 
subscriber connection units in telecommunications 
networks. The digital signal received by the destination 



(in the broad sense) is converted back into analog so 
that it can be heard by means of a digital-to-analog 
converter (DAC) driven by a clock at the same frequency 
as that used by the ADC of the source- Under such 
conditions, the entire system is entirely synchronous as 
generally applies to present-day switching and 
transmission systems. These can include data rate 
reduction systems (for example for a telephone signal, 
for converting from 64 kilobits per second (kbit/s) to 
32 kbit/s or 16 kbit/s or 8 kbit/s) . It is the network 
(or terminal systems as in the case of the integrated 
services digital network (ISDN) for example) which 
undertakes the operations of ADC, of encoding and 
decoding (where encoding and decoding are used in the 
context of reducing data rate), and of DAC. The clocks 
are always distributed and the system comprising ADC, 
speech encoding, transmission and switching, speech 
decoding, and finally DAC is fully isochronous. There 
are no losses or repeats of speech samples in the 
decoder. 

The above-described synchronous transmission systems 
require the presence of a reference clock throughout the 
network. Transmission systems are now making greater and 
greater use (initially for data) of asynchronous and 
packet techniques (IP protocol, asynchronous transmission 
mode (ATM)) . In numerous new situations, the decoder has 
no reference concerning the sampling frequency used by 
the encoder and it must be capable, using its own means, 
of reconstituting a decoding clock which attempts to 
track the reference of the encoder. The present 
invention is thus particularly advantageous in frame 
relay telephone systems, in ATM telephony, or in IP 
telephony. The technique described can easily be used in 
other fields of speech or sound transmission in which 
there exists no effective transmission of the clock 
reference from the encoder to the decoder. 
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DESCRIPTION OF THE STATE OP THE ART 
The general problem 

The general problem posed by transmission systems to 
which the invention applies is that of mitigating the 
fact that the speech or sound decoder has no clock 
reference associated with the source encoding. 

In this respect, two circumstances can be 
distinguished: those corresponding to "weak" asynchrony 
and those corresponding to "strong" asynchrony. 



Weak asynchrony 

As an illustration, we consider the case of a 
transmission system comprising the following, as shown 
diagrammatically in Figure 1: 

15 - an encoding source 1 comprising an analog-to- 

digital converter driven by a reference clock at 
frequency F^^^ equal to 8 kilohertz (kHz) (to provide 
numerical values in the worked examples below) and a 
speech encoder (of greater or lesser complexity and 

.20 reducing the data,.rate transmission to a greater or 

lesser extent) ; 

- an asynchronous transmission system (represented 
by link 2) which conveys the information produced by the 
encoding source using its own transmission clock and its 

25 own protocols (for example the speech encoder could 
produce a data rate of 8 kbit/s and the transmission 
system could be constituted by an RS.232 type 
asynchronous link operating at 9600 bit/s) ; and 

- a reception and decoding system 3 receiving the 
3 0 information conveyed over the asynchronous link (whose 

data rate must necessarily be a little greater than the 
raw encoding data rate, e.g. 9600 bit/s instead of 
8000 bit/s) having the function of producing the signal 
after decoding (decompression) and applying the signal it 
35 produces to a digital-to-analog converter connected to a 
transducer such as a loudspeaker, a telephone handset, a 



headset, or a sound card installed in a personal computer 
(PC) . 

It will be understood that since the reception and 
decoding system 3 has no clock reference, it must 
5 implement a strategy in order to mitigate this asynchrony 
between the encoder and the decoder. 

Whatever the encoding technique used or the type of 
transmission which does not directly convey a clock, or 
time markers within the transmitted frame, or indications 
10 concerning transmission instants, the above-mentioned 

problem can be reduced (ignoring the speech encoder, the 
y; asynchronous transmission system, and the speech decoder) 

'Z-" to a system comprising the following, as shown in 

r Figure 2 : 

%r 15 - an analog-to-digital converter 4 for converting 

speech signals or sound from analog to digital form at a 
sampling frequency set by a local oscillator; 
^ - a digital -to- analog converter 5 for playing back 

IT? the sound via a transducer suitable for the field of use 

f"- ...20 in quest±on-;:a^^ frequency given 

!t: by a local oscillator which, a priori , is at the same 

fj frequency but which is never at exactly the same 

frequency for reasons of acceptable manufacturing cost 
(highly stable and very accurate frequency sources do 
25 indeed exist, but they need to be temperature -compensated 
and they are unacceptably expensive for mass-produced 
industrial implementation) ; and 

- a digital register 6 into which the analog 
converter 4 writes at its own sampling frequency (F^^^ ' 
3 0 said register being read at the sampling frequency (F^^^c^ 
of the playback system by the digital-to-analog converter 
(DAC) . 

It will be understood that since the two clock 
frequencies (F^^^ and Fj^^^-.) are different, it is necessary 
35 from time to time for the DAC to reread the same 

information twice over (if Fj^j^^ is greater than F^^) or on 



the contrary (where F^ac is less than F^c) allow the 
ADC to overwrite information before the DAC can read it , 

The oscillators that are commonly available in the 
trade are characterized by the accuracy with which they 
operate (within a certain temperature range) . 

Oscillators that are accurate to within 50 parts per 
million (ppm) are quite commonly available and are used 
to provide numerical values for the worked examples below 
showing how frequently samples are lost or repeated when 
the sampling frequency is 8 kHz (the reader can easily 
determine that at higher sampling frequencies samples are 
skipped or repeated at a frequency which is prorata the 
sampling frequency; the higher the sampling frequency the 
higher the frequency at which samples are skilled or 
repeated) . 

Under the least favorable conditions, an ADC is 
operating at 8000x(l + 50.e-6) in association with a DAC 
operating at 8000x(l - 50.e-6). In this particular 
example, the skip period (period for samples being 
omitted in the DAC since Fq^c is less than F^^c) is easily 
calculated by counting the number of periods of the DAC 
(where the period is longer than that of the ADC) that 
produces a value equal to said period of the DAC when 
multiplied by the difference between the periods. 

Writing the period of the DAC as Pp^c (i^ ^^is case 
1/8000 X (1 - 50.e-6)) and P^^ the period of the ADC 
(in this case 1/8000 x (1 + SO.e-G)) we obtain N x (Pdac " 
^ADc) = ^DAc- ^ represents the number of individual 
operations that stem from the period difference. Writing 
50.e-6 = s and applying the simplifications that are 
common for small quantities, we obtain N = 1/ (2s) . In 
this example, that immediately gives the skip period as 
being close to 1.25 seconds (s) . If the accuracy of the 
local oscillators is improved (e.g. by going from 50.e-6 
to 5.e-6) then the skip period will increase (in this 
case there will be one skip every 12.5 s) . 



In a complete transmission system including audio 
encoders operating on signal frames, this phenomenon of 
"slip" between two clocks will give rise to an absence of 
speech frames (no frame to be decoded in the time 
available for decoding) or to overabundance of frames 
(i.e. two frames for decoding instead of one in the 
available time) . Taking the example of a speech encoder 
operating on 30 millisecond (ms) frames at 8 kHz, i.e. 
240 samples, in each 30 ms time slot the receiver and 
more particularly the decoder expects to receive one 
frame for decoding in order to ensure that playback of 
the speech signal remains continuous. Unfortunately, if 
F^c is less than Fj^j^^, then on the above assumptions, 
there will be an absence of any frame of samples for 
decoding by the sound playback system once very 240 x 
1.25 = 300 s, and in the converse situation there will be 
two frames instead of one (i.e. a frame to be 
"eliminated") at the decoder once every 300 s. Under 
such circumstances, the awkward phenomenon of samples 
being""~skipped" br"r^^ becomes" very disagreeable since 

an entire block of the signal is skipped or repeated, and 
this needs to be managed appropriately. 

Strong a synchrony 

Certain types of transmission amplify this problem 
of asynchrony due to the phenomenon of "slip" between 
clocks as explained above. This is what we refer to 
herein as "strong" asynchrony. 

When transmission is imperfect, giving rise to 
samples or frames of samples being lost and also when 
transmission generates jitter on sample arrival times, 
where such jitter is associated neither with the sending 
clock nor with the receiving clock, but is associated 
with other mechanisms in the transmission system having 
their own clocks, then the receiver system can be 
confronted with an absence of several frames, or with an 
overabundance of several frames. This can apply, for 



example, with IP type networks which suffer from the 
phenomenon of packets being lost and from the phenomenon 
of jitter introduced during packet routing. These 
phenomena disturb the continuity of the sound playback of 
5 the audio signal very strongly. When packets are lost or 
when jitter delays one or more packets, the playback 
system finds itself without any sample (or frame of 
samples) to apply to the DAC for the purpose of ensuring 
continuity in audio playback. Conversely, when jitter is 

10 strong, the playback system can find itself with far too 
many frames or samples to be sent simultaneously to the 
DAC. When jitter is strong, sound signal frame 
transmission can take place in the form of bursts, thus 
creating phenomena both of gaps and of overabundance 

15 amongst sample frames. 

It will be observed that using speech encoders 
operating with a system of the voice activity 
detector /discontinuous transmission/comfort noise 
generation (VAD/DTX/CNG) type, a mechanism is also 

20 introduced-4:ha£. i to the loss- of a packet since 

in the event of silence, the sender will cease to send 
frames of samples. Ceasing to send samples can be 
perceived at the receiver as being the same as the loss 
of a packet or as circumstances in which the ADC clock is 

25 faster than the DAC clock, which leads to holes in the 
signal at the receiver, as shown above. 

"Strong" asynchrony thus differs from "weak" 
asynchrony by involving not only cyclical skips and/or 
repetitions, but also holes in the signal and/or 

3 0 overabundance of the signal in multiple and non-cyclic 
manner . 

Description of various existing methods 

Two main methods are presently known for mitigating 
3 5 the drawbacks due to the fact that the speech or sound 
decoder has no clock reference. 
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The first consists merely in proceeding as described 
above in the paragraph describing "weak" asynchrony, i.e. 
by skipping or repeating samples. The decoding system 
produces samples at a rate that is more or less equal to 
that of the encoder and it presents them to the digital- 
to-analog converter at said rate (means for implementing 
the above reconstruction system are known to the person 
skilled in the art) . In some cases, for example when 
"strong" asynchrony applies with transmission being in 
the form of frames, it is preferable when samples for 
playing back are missing to send null sample frames to 
the ADC rather than repeating a preceding frame. 
Furthermore, in the converse situation, when surplus 
samples are present, they are not eliminated directly, 
but a first-in-first-out (FIFO) register of some size can 
be used to absorb jitter to some extent. If the FIFO 
register becomes too full, then that triggers partial or 
complete emptying of the FIFO, thereby giving rise to new 
skips in sound playback. 

The" second ^^^^^m^ is more complicated and 

provides better performance, requires a loop to be 
implemented to recover a hardware clock which is servo- 
controlled by the filling level of a buffer memory for 
the signal to be decoded (or to be transmitted as in the 
ATM adaptation layer number 1 (AALl) for example) . That 
method of servo-control attempts to use the clock 
recovery loop to recover the sampling frequency of the 
source. The filling level of the receive buffer produces 
a control signal for servo -control ling a digital or 
analog phase- locked loop (PLL) . 

The first method is extremely simple to implement 
but suffers from a major defect associated with the 
quality of the sound reproduced. A skip or elimination 
once every 1.25 seconds can be very disagreeable to 
listen to, and this can occur with "weak" asynchrony 
associated with correction at sample level. Similarly, 
for a system operating with frames of samples, the 
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inserted repetitions or blanks, and the discontinuities 
in the signal due to frames being eliminated amplify loss 
of quality which becomes highly perceptible and very 
disturbing for the listener. 

Furthermore, the use of a FIFO, memory runs the risk 
of establishing a considerable delay in transmission and 
that also harms the overall quality of a call. 

The second method is much more complex to implement 
and requires a clock servo-control mechanism, and thus 
requires special hardware. However, it provides partial 
synchronization and therefore avoids problems associated 
with managing asynchrony. Nevertheless, that method 
adapts poorly to discontinuous transmission systems, to 
systems involving last frames, or to systems with high 
levels of jitter. Under such circumstances, 
synchronization information is no longer available. 
Furthermore, that method cannot be envisaged on terminal 
platforms where clock servo-control is not possible, as 
is the case in particular with PC type terminals, for 
example, where the system used for playing back sound is 
a sound card. 

Devices are already known from document WO/ 9 9 17 584 
for implementing a method in accordance with the preamble 
of claim 1, the devices having only one buffer memory. 

Document US-A 4 703 477 facilitates reading voice 
data by implementing a method of putting frames relating 
to the same voice data end-to-end. 

SUMMARY OF THE INVENTION 

A general object of the invention is to propose a 
solution to the problems associated with continuity in 
the playback of a speech signal in the presence of 
asynchronous transmission, and to do so by taking action 
at receiver level, i.e. at the end of the transmission 
system. 

To this end, the invention provides a method of 
managing the decoding and playback of a sound signal in 
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an asynchronous transmission system, in which any 
overabundance of filling of a first buffer memory and/or 
of a second buffer memory situated at the inlet or at the 
outlet of a decoding block is detected by comparing the 
filling level with at least one threshold, the method 
being characterized in that, depending on the value of 
the filling level: 

- voice activity detection is implemented and frames 
considered by said detection as being non-active are 
eliminated; and 

" concatenation processing is implemented on two 
successive frames to compact them into a pseudo-frame of 
length less than or equal to one frame, the length 
reduction ratio of the pseudo- frame relative to the 
length of the two frames being greater than or equal to 
two. 

Such a method is simple to implement and provides a 
guarantee of quality by avoiding excessive increase in 
transmission delay and by managing holes in the speech 
signal effectively. Furthermore, it does not imply any 
specific hardware servo-control circuit, and can 
therefore be quickly adapted to different asynchronous 
networks, terminals, and platforms. 

The method is advantageously associated with the 
various characteristics below taken singly or in any 
technically feasible combination: 

- voice activity detection is implemented and frames 
considered by said detection as being not active are 
eliminated whenever the filling level lies between a 
first threshold and a second threshold, and in that 
concatenation processing is implemented on two successive 
frames whenever the filling level lies between a second 
threshold and a third threshold; 

- the first and second thresholds are the same; 

- detection is performed at the inlet or the outlet 
of a decoding block having a first buffer memory at its 
inlet and/or its outlet to determine whether any frame is 
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missing or erroneous or whether any samples to be played 
back are absent, and a fake frame is generated to ensure 
continuity in the audio playback on detecting such a 
missing or erroneous frame, or on detecting such an 
absence of samples for playback; 

- when the decoding block implements its decoding 
processing in cyclical manner relative to the content of 
the first buffer memory, detection of any missing or 
erroneous frame or of any absence of samples to play back 
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ls implemented at the same cyclical frequency, said 
detection taking place far enough in advance relative to 
the decoding process to make it possible to generate a 
fake frame in good time; 

- a fake frame is not generated when a missing or 
erroneous frame is detected for a frame on which an 
absence of samples has already been detected; 

- for a system of the type which can voluntarily 
stop sending frames, the type of the previously-generated 
frame is stored from one frame to the next, and this 
information is used to determine whether to generate fake 
frames or to generate frames of silence; 

- in processing for concatenating two successive 
frames, the samples are weighted in such a manner as to 
give more importance to the first samples of the first 
frame and to the last samples of the second frame; 

- the threshold (s) is/are adaptive; and 

- a threshold is adapted as a function of the length 
of time passed with a filling level above a given 

"thnresholdT " 

The invention also provides a device for playing 
back a speech signal, the device comprising a first 
buffer memory receiving coded frames, means implementing 
decoding processing on the frames stored in said first 
buffer memory, a second buffer memory receiving decoded 
frames output by the decoding means, and sound playback 
means receiving the frames output by the second buffer 
memory, the device being characterized in that it further 
comprises means for implementing the above-specified 
method . 

As will be understood on reading the following 
description, these means are essentially computer means. 

DESCRIPTION OF THE FIGURES 

Other characteristics and advantages of the 
invention appear further from the following description 
which is purely illustrative and non-limiting and which 



should be read with reference to the accompanying 
figures, in which: 

- Figure 1 is a block diagram of an asynchronous 
transmission system; 

5 - Figure 2 is a diagram showing a model of such a 

t r ansmi s s i on sy s t em ; 

- Figure 3 is a diagram of a receiver device; and 

- Figure 4 shows the signals obtained by 
implementing concatenation processing as proposed by the 

10 invention. 

DETAILED DESCRIPTION OF ONE OR MORE EMBODIMENTS 

The method proposed by the invention for managing 
asynchrony in transmission implements two kinds of 
15 processing corresponding to handling the two phenomena 
described above, namely the lack of samples and surplus 
samples . 

1. Description of the sound playback system in a 

2 0 -convent iona3r"^xansinission application 

As shown in Figure 3, the playback system for a 
speech signal comprises three elements: 

- A block 10 waiting to receive samples or frames of 
code coming from the network. The block 10 contains a 

25 FIFO type memory 11 or circular type buffer memory 
(referred to as "FIFO 1" in the description below) 
enabling frames to be stored on a temporary basis prior 
to being decoded. 

- A decoding block 12 which takes the frames coming 

3 0 from the block 10, decodes them, and stores them in turn 

in a FIFO memory 13 (referred to below as "FIFO 2") . 

- A playback block 14 which takes the decoded sample 
frames and applies them to any kind of sound playback 
system 15 . 

3 5 Depending on the terminals and the way the system is 

organized, the clock frequency used for sound playback 
(i.e. the digital-to-analog converter frequency Fj^j^^ ) is 



not necessarily directly associated with all of the 
blocks. Since the block 14 is directly associated with 
the playback system, it is directly associated with the 
frequency Fj^^^. However the other blocks can be 
associated instead with the rate at which frames arrive 
from the network rather than with the frequency Fj^^^. 
Taking the example of a terminal provided with a 
multitasking system, and in which each block is performed 
by a specific task, the tasks 10 and 12 can thus be 
associated with frame reception. The task 10 waits for a 
frame from a network, which frame is then decoded by the 
task 12 and placed in the memory FIFO 2. 

Meanwhile the task 14 clocked at F^^^^ takes samples 
from the memory FIFO 2 and delivers them continuously to 
the sound playback system. 

It can thus be seen that regardless of whether the 
asynchrony is "strong" or "weak", it is the way in which 
the memory FIFO 2 is managed that requires particular 
attention. Similarly, if the task 12 were strongly 
associated with the task 14, then particular attention 
would be required by the memory FIFO 1. 

The mechanism constituting an implementation of the 
invention is described below in application to managing 
the memory FIFO 2, but the description includes 
explanations about how to transpose it with certain 
adaptations to managing the memory FIFO 1 . 

2 . Absence of samples 

In order to continue playing back sound in the 
absence of samples, both potential causes of samples for 
playback being absent are treated. The first cause 
corresponds to information contained in lost packets, 
while the second cause corresponds to the absence of any 
samples to play back (e.g. FIFO 2 empty) even though it 
is still necessary to keep on sending samples to the 
sound playback system. 



2.1 Loss of frames or erroneous frames 

The processing applied to lost frames or to 
erroneous frames requires a transmission system to be 
available that gives access to information about frames 
5 being lost and about erroneous frames being received. 
This is often the case in transmission systems. 

For example, in IP networks, it is possible to use 
the marking of packets coming from the real time transfer 
(RTP) layer, which marking gives the exact number of 
10 samples lost between two packets of audio code being 

received. This information about loss of frames, or in 
the case of IP about loss of packets (each containing one 
or more speech frames) generally becomes available only 
once the packet following the lost packet (s) is itself 
15 received. 

Iff It is not necessarily advantageous to take action, 

zJi while one or more valid frames can be decoded. With new 

5 generation speech encoders (CELP encoders, transform 

M encoders, ...), in order to ensure that the quality of 

1,3.1 ^ 

^ 20'"" spundrpTayback: . 13 maiht^a^^ is often necessary to 

?|3 ensure a degree of synchronism between the encoder and 

the decoder. The loss of this encoder/decoder 
synchronism can be compensated by using frame loss 
correction algorithms associated with the speech encoder 
25 used. By way of example, these algorithms are provided 
in the standards for certain speech encoders (e.g. 
International Telecommunications Union (ITU) standard 
G. 723.1). When using simpler encoders, such a mechanism 
is not always necessary. 
3 0 When a large number of frames have been lost, the 

number of "fake" sample frames that need to be generated 
in order to pack out the memory FIFO 2 can be limited. 
The purpose of processing fake frame generation is to 
fill holes in such a manner as to ensure signal 
35 continuity while also smoothing the internal variables of 
the decoder so as to avoid excessive divergence on 
decoding the first valid frame following the invalid or 
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lost frames, thereby avoiding any audible discontinuity. 
After a few frames have been generated, it can be assumed 
that the variables have been smoothed, and thus that the 
generation of such fake frames can be limited to a small 
number of frames (e.g. four to six) whenever a large 
number of frames has been lost. 

It will thus be understood that processing is servo- 
controlled in this way relative to information about lost 
frames . 

Similar processing is implemented on the basis of 
information about invalid frames. This information is 
forwarded to the decoder by the network portion of the 
receiver and it arrives soon enough to enable a frame 
correction algorithm to be implemented which, by taking 
account of such a non- valid frame, makes it possible to 
ensure continuity in the signal, and thus to avoid having 
another cause for samples being absent in the memory 
FIFO 2. 

To sum up, this first process corresponds to 
managing information of the type "n frames lost" or 
"invalid frame received" coming from the network layer of 
the receiver. This management is characterized by 
implementing an algorithm for correcting frame losses 
(also referred to in this document as an algorithm for 
generating "fake" frames) . This first process therefore 
acts at decoding task level and feeds the memory FIFO 2. 

2.2 Absence of samples to be played back 

This second process is associated with the clock 
coming from the task 14, i.e. with the clock at the 
frequency F^^^^.. As mentioned above, the memory FIFO 2 (or 
FIFO 1 if the task 12 is included in the task 14) can 
become empty of samples even though it is still necessary 
to supply samples to the sound playback system. It is 
then necessary to supply the playback system with samples 
and if possible to avoid playing back zeros (since this 
degrades the sound signal very greatly) . 
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This second process can be thought of as a feedback 
loop on frame decoding. This loop causes the algorithm 
for correcting frame losses to be called and as a result 
it needs to be activated soon enough to enable the 
algorithm to be executed and to enable samples to be sent 
to the sound playback system. Depending on the platform, 
this feedback can be called in different ways. 

This loop can be implemented in two ways which are 
described below. 

For a single-task receiver (e.g. a digital signal 
processor (DSP) without any real time operating system 
(RTOS) ) , the audio decoder portion is tied completely to 
the DAC clock (F^^^.) and is therefore permanently waiting 
for a frame to be decoded in cyclical manner. For 
example, with a speech encoder using 3 0 ms frames, 
waiting loops are built up in periods that are multiples 
of 3 0 ms. 

Thus, for a 30 ms loop, the decoder will, every 
3 0 ms, be expecting a frame for decoding to be placed in 
the memory FIFO 1 (which can correspond merely to a frame 
passing from the network layer to the task 12) . On 
arrival of the frame, it is decoded and placed in the 
memory FIFO 2 for sending to the DAC. The feedback 
processing is implemented whenever it is observed that 
there is no frame for decoding in the memory FIFO 1 at 
the time given by: 

T = To + 30 ms - Tc 

where : 

To = the start time of the 3 0 ms wait loop; and 
Tc = the time required for executing the algorithm 
for generating fake frames with a safety margin 
corresponding to interrupts and/or other auxiliary 
processing that might take place before the end of the 
loop. 

Processing is thus implemented with a latency time 
deadline of Tb (loop time) - Tc (computation time + 
margin) . 



with a multitasking receiver (e.g. a PC terminal), 
time is not managed with such precision and the 
processing implemented must therefore be somewhat 
different. (Note: this processing nevertheless remains 
quite close to the preceding process since it too seeks 
to take account of the time Tc . ) 

Under such circumstances, the only waiting loops 
available are often those associated with events, e.g. 
the fact that packets have been received by the network, 
or the fact that buffer memory n (containing one or more 
sample frames) sent previously to the sound playback 
system has been read by the DAC and is therefore again 
available for sending samples to the DAC. 

Depending on the structure of the system and on 
whether or not it is necessary to respond quickly to an 
event, it is possible to wait for a certain length of 
time before filling said buffer memory prior to 
forwarding to the DAC. Such a latency time is selected 
in such a manner as to leave enough time for the 
^Igor^^ ^frames to. execute, if 

necessary. 

Then, possibly after said latency time has elapsed, 
the process verifies that sufficient samples are present 
in FIFO 2 (note: this could apply to FIFO 1 if the 
management takes place at its level) , and if not it 
requests an appropriate number of fake frames to be 
generated in order to fill buffer memory n. 

When the system is such that it is necessary to fill 
buffer memory n "immediately", then monitoring the 
availability of samples and possibly calling for the 
"fake" frames generation processing are implemented 
directly after each delivery to the DAC from the buffer 
memory so that the generated samples are already in the 
memory FIFO 2 when the event "buffer memory n available" 
occurs . 

Thus, whatever the receiver, the process observes 
the absence of samples to be sent to the sound playback 
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system by implementing a check on the content of the 
buffer memory FIFO 2 (or FIFO 1 depending on how the 
sound playback system is managed) and activates the 
algorithm for generating "fake" frames in order to 
generate the missing samples. 

It will be understood that the second process 
responds firstly to the problem of "slip" between clocks, 
and more precisely to the circumstance in which the 
received clock (Fj^^^) is faster than the send clock (F^^c) • 
It also applies to the phenomenon of frames being lost 
since this can lead to there being an absence of samples 
to send to the DAC even before frame loss has been 
detected, since such detection occurs only on receiving 
the frame following the loss. 

In order to combine the actions of the first and 
second processes, the first process is prevented from 
generating "fake" frames on detection of frame loss 
whenever the corresponding frames have just been 
generated by the second process. 

For this purpose r also of 

counters determining the number of samples that have been 
generated by the second process. 

2.3 Specific actions for speech encoders using 
VAD/DTX/CNG services 

Encoders using a VAD/DTX/CNG system can voluntarily 
stop sending frames; under such circumstances, the 
absence of samples must not be considered exactly as a 
loss of frames, but rather as a period of silence. The 
only way of determining whether the frame to be generated 
must be silence or should correspond to a lost frame is 
to know the type of the previously-generated frame (i.e. 
signal frame or frame corresponding to a lost frame, or a 
noise update frame (SID) , or a frame of silence (NOT) ) . 
For this purpose, the type of the generated frame is 
stored, and while frames are being generated to 
compensate for an absent frame or a lost frame, it is 



decided whether fake frames should be generated using the 
algorithm for correcting frame losses (as applies when 
the preceding frame was of the FSF type) , or whether 
frames of silence should be generated by activating the 
decoder appropriately (as applies when the preceding 
frame was of the SID or the NOT type) . 

3. Overabundance of samples to be played back 

In order to deal with an overabundance of samples to 
be played back, processing is implemented to empty out 
frames, eliminating certain frames in full or in part 
prior to their possibly being taken into account by the 
sound playback system. 

This processing enables frames to be stored in 
buffer memories until certain thresholds trigger actions 
for limiting the amount of frames in memory and for 
limiting any corresponding increase in delay across the 
communications system. This limited storage makes it 
possible to accommodate jitter phenomena on receiving 
"3ramea in bur^^^^ slip between clocks , while 

nevertheless limiting transmission delay. 

3 . 1 Emptying out processing 

Any accumulation of frames is initially detectable 
in the memory FIFO 1, and is subsequently transferred to 
the memory FIFO 2 . 

The proposed method manages information concerning 
the filling level of a reference buffer memory, i.e. 
FIFO 1 or FIFO 2 depending on how the tasks 10, 12, and 
14 are organized in the receiver (as explained above) . 
If the tasks 12 and 14 are associated with each other, 
then the filling level information used by the method 
relates to the memory FIFO 1 which acts as a buffer 
between the network and the sound playback system. 
Similarly, if the tasks 10 and 12 are associated, then it 
is the memory FIFO 2 which acts as a buffer and it is 



therefore its filling level which is taken into 
consideration by the management process. 

The process is explained below for the second 
context. The first is merely an immediate transposition 
thereof • 

In order to maintain synchronization as closely as 
possible between the encoder and the decoder, and thus 
maintain optimum sound playback, all of the frames coming 
from the network are decoded. The process then decides 
on what action to take on the decoded frame as a function 
of information concerning filling level. This action is 
described in greater detail below. To activate the 
processing, filling level thresholds are used. These 
thresholds define filling alarm levels for the FIFO 
memory. In order to take action that is as inaudible as 
possible (i.e. in order to limit quality degradation) two 
levels of action are selected. A first level (alarm 
level 1) corresponds to the filling level being excessive 
but not critical (far from the maximum acceptable filling 
level!, Zwhiie^ 2) corresponds 

to it being mandatory to take action on each frame (this 
level is quite close to the maximum acceptable level) . A 
third or "emergency" level (alarm level 3) is also 
defined in order to avoid memory overflows or other 
problems. This level corresponds to filling being very 
close to the maximum acceptable level. Alarm level 3 
should never be reached if the actions taken at the two 
preceding threshold levels are properly performed and if 
the thresholds are properly defined. 

Each time decoding is performed, the information 
concerning filling level is compared with the thresholds 
in order to determine the state of the FIFO (in an alarm 
state or not) , and, where appropriate, the level of the 
alarm. 

If the state obtained is not an alarm state, then no 
action is undertaken and the decoded frame is stored in 
FIFO 2. 
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In the first alarm state, it is considered that at 
least 50% of the signal coming from a conversation is not 
useful and therefore at this alarm level, all frames 
presenting very little information are eliminated. For 
this purpose, it is possible to implement simple VAD 
which monitors all frames of samples after they have been 
decoded to decide whether or not they should be written 
into FIFO 2. The process can also make decisions based 
on information taken directly from the code frame 
concerning the importance or otherwise of the information 
contained in the frame. In this alarm state, any frame 
that is considered as containing nothing but noise is 
simply not stored in FIFO 2 for future sound playback. 

In the second alarm state (critical level) , it is 
necessary to take action on each frame to curb growth in 
the filling level of the memory FIFO 2 very aggressively. 
At this level, the preceding processing (i.e. the 
processing implemented for alarm level 1) remains active. 
However it is now also necessary to shorten pairs of 
cpnsecutive_f;ra^ of one frame or 

shorter. A decision is therefore taken on the basis of 
two non-" silent" sample frames (given that any frame that 
is "silent" is merely not written to FIFO 2 as a result 
of alarm state 1 being already included in alarm state 
2} . Action on two consecutive frames is therefore 
undertaken only when a frame is detected as being non- 
"silent". The frame is initially stored, and then if the 
next frame is "silent", then it is only the first frame 
that is written into FIFO 2. 

When both frames contain important information, it 
becomes necessary to replace them by a single frame while 
minimizing loss of information and degradation of 
quality. It is the replacement frame that is stored in 
FIFO 2 . Any effective solution capable of performing 
this task can be used and activated under such conditions 
(i.e. second alarm state and two non-"silent" frames). 
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Two examples of algorithms for perfoinning this task are 
described below. 

In a first algorithmic solution, the two contiguous 
frames are replaced by a single frame in which each 
coefficient Xj (where i lies in the range 0 to N-1 and 
where N is the number of samples per frame) is given the 
value (Xi + Xi^i)/2 (where ± lies in the range 0 to 2N-1, 
with the coefficients x^ coming from both original 
frames) . This solution amounts to performing a kind of 
smoothed undersampling . The frequency of the played-back 
signal is thus doubled for this frame. Nevertheless, the 
inventors have found that providing alarm state 2 does 
not occur very frequently, this solution suffices to 
maintain the quality of sound playback. 

In a second solution, signal amplitude is detected 
to enable the two frames to be compacted into a pseudo- 
frame of length shorter than or equal to that of one 
frame. The number of samples contained in the pseudo- 
frame is determined by the fundamental frequency or 
:j:pitch"::inf br^^ but in all " events it is shorter than 

or equal to the length of a normal frame, while 
nevertheless being close to said frame length. The 
algorithm used ensures continuity of the played-back 
signal without any audible hole, and without any 
frequency doubling, while nevertheless dividing the 
amount of storage required for the signal by a factor 
that is greater than or equal to 2. This is explained in 
greater detail in paragraph 3.4 below. Furthermore, this 
also minimizes the loss of sound information since less 
than 50% of the information is in fact eliminated. 

It should be observed that when the receiver 
implements its processing on the basis of analyzing 
FIFO 1, with the decoder being directly. associated with 
the sound playback system, it is necessary to generate a 
number of samples that is sufficient, i.e. in the present 
case a number that ensures at least one frame of samples 
is made available in FIFO 2. The frame concatenation 



algorithm is then calibrated to ensure that it always 
generates a minimum number of samples, but at least one 
frame. Another solution would consist in activating the 
algorithm several times over instead of only once when 
5 that is necessary to obtain a sufficient number of 
samples . 

In the third alarm level, which is normally never 
reached, no frames are stored in FIFO 2. In a variant, 
the system can also decide to clear out a fraction of the 

10 buffer memory suddenly (this can apply where it is 
management of FIFO 1 that is activated) . 

It should also be observed that depending on the 
network and on the types of problems it generates in 
terms of asynchrony, it is possible to decide whether or 

15 not to activate particular alarm levels. For example, 
when asynchrony is "weak" then alarm levels 1 and 2 can 
be combined, and the simple solution of replacing two 
frames by a single frame can then constitute the only 
active process. 

20 ~r: """"" : 

3 . 2 Alarm thresholds 

There follows a more detailed description of the 
alarm thresholds and how they are managed • 

As explained above, the reference memory is said to 
25 be in alarm state 1 when its filling level is above 

threshold 1; this state remains active until its filling 
level drops below threshold 0. State 1 therefore 
operates with hysteresis. 

The memory is said to be in alarm state 2 if the 
30 filling level exceeds threshold 2 and to be in alarm 

state 3 if the filling level exceeds threshold 3. It is 
possible to envisage managing these alarm states with 
hysteresis as well . 

Thresholds 0, 1, and 2 are adaptive. Threshold 3 is 
35 directly associated with the maximum acceptable size and 
it is fixed. These thresholds need to be adaptable in 
order to accommodate different call contexts and real 
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time fluctuations during a call. It is appropriate to 
allow a greater amount of delay when the call is 
suffering a large amount of jitter (where delaying 
playback remains the best way of ensuring acceptable 
quality in the presence of jitter). In a high- jitter 
context, it is therefore appropriate for the thresholds 
0, 1, and 2 to be at quite high levels. 

To facilitate processing, the positions of the 
thresholds can correspond to integer numbers of frame 
sizes as exchanged between the various tasks of the 
receiver. This frame size is written Tt . 

By way of example, the initial values of these 
thresholds can be as follows : 

Threshold 0 : 5 x Tt 

Threshold 1: 8 x Tt 

Threshold 2: 12 x Tt 

Threshold 3: 24 x Tt (fixed value) 

The thresholds 0, 1, and 2 can be adapted together 
in steps of size Tt . Extreme acceptable values can lie 
"in— the -range -1 to +8 , for example , 

Thus, threshold 1 can take on values 7x, 8x, 9x, lOx, 

16x Tt. Threshold adaptation proper is performed on 
the basis of an adaptation criterion which is the length 
of time spent in the alarm state. For this purpose, an 
alarm state percentage is evaluated about once every N 
seconds (e.g. N = 10) . When this percentage exceeds a 
given threshold (5%) , the alarm state thresholds are 
increased; otherwise, when these percentages are below a 
given minimum threshold (0.5%) the alarm threshold are 
decreased. To ensure that the system does not oscillate 
excessively due to its thresholds being adapted too 
frequently, hysteresis is applied to adaptation decision 
making. The thresholds are actually increased by one 
step, only in the presence of two increase options that 
are consecutive, and they are decreased by one step, only 
in the presence of three decrease options that are 
consecutive. As a result, the length of time between two 



threshold increments is at least 2N seconds and the 
length of time between two threshold decrements is at 
least 3N seconds. The procedure for increasing 
thresholds can be accelerated if a large percentage of 
frames are in an alarm state. One accelerating procedure 
consists in increasing the thresholds directly, for 
example whenever the alarm percentage exceeds 50%. 

Naturally, the threshold values given for the alarm 
thresholds are provided purely by way of indication. 

3.3 Interaction with the first process 

The first process is the process which causes "fake" 
frames to be generated when frames are lost or erroneous. 
When the system is in an alarm state (overabundance of 
frames) , there is no need to generate "fake" frames which 
would merely aggravate the phenomenon of overabundance. 
Nevertheless, in order to maintain high quality sound 
playback it is important to maintain encoder /decoder 
synchronization by informing the decoder whenever a frame 
has been lost (e.g. by launching the generation of one or 
two fake frames, but no more) . The third process will 
act in the alarm state on the first process so as to curb 
very strongly the generation of "fake" frames. 

3.4 Frame concatenation 

The purpose of the concatenation process is to 
shorten the duration of a digital audio signal containing 
speech or music while introducing as little audible 
degradation as possible. Since the sampling frequency is 
given and fixed, it is the number of samples sent to the 
sound playback apparatus that is decreased. One obvious 
solution for shortening a sequence of N samples is to 
remove M regularly spaced apart samples from the sequence 
in question. This causes the fundamental frequency to 
increase and that can be unpleasant for the listener, 
particularly when the ratio M/N is large. Furthermore, 
there is a danger of no longer complying with the 
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sampling theorem. The process described below makes it 
possible to shorten an audio sequence without modifying 
its fundamental frequency and without giving rise to 
audible degradation due to signal discontinuity. This 
5 process is based on detecting the value of the pitch 
period. The number of samples eliminated by this 
algorithm cannot be selected freely, since it is a 
multiple of the pitch value P. Nevertheless, it is 
possible to define a minimum number of samples to be 
10 eliminated N^^^^ which must satisfy the relationship N^^^^ - 
N/2. Since the purpose is to eliminate at least 50% of 
y, the samples in the context of the device for managing 

O asynchrony in an audio transmission, it is advantageous 

?=f to set Ng^ij^ = N/2. It is also assumed that the maximum 

m 15 value of the pitch P is less than the length N of the 
;£ sequence to be shortened. The number N^ of samples that 

^ are eliminated by the algorithm is then the smallest 

s multiple of the pitch value P that is greater than or 

equal to Ng^j^^^. I.e. N^ = kP where k is a positive integer 
flj 20 and'Ng "S^N^jj^in > - P. The length of the output signal 
2 is then N^ = N - N^. The input signal to be shortened is 

SI written s (n) , where n = 1, N and the output signal 

Ms? 

is written s' (n) where n = 1, N^. In order to ensure 

continuity in the output signal, the first and the last N^. 

25 samples of the signal s (n) are merged progressively, i.e. 
s Mn) = s (N^=n) .w(n) + s (n) . (1-w/ (n) ) for n = 1, . . . , 
where w(n) is a weighting function such that 0 < w(n) < 
1, for n = 1, . . . , Nj. and w(n) < w(n+l) for n - 1, . . . , 
Nj.-1. For example, w(n) can merely be the linear function 

30 w(n) = n/Nj.. For an unvoiced signal where it is not 

possible to determine the pitch, can be fixed freely. 

Figure 4 showing signal sequences A, B, C, and D 
illustrates how the process is implemented on a worked 
example. The first sequence (A) is shown as a solid line 

35 and constitutes a piece of the signal s (n) to be 

shortened that is N = 64 0 samples long. The purpose is 
to shorten this sequence by at least 320 samples without 
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changing its fundamental frequency, and without 
introducing any discontinuity or other audible 
degradation. The pitch of s(n) varies slowly, its value 
being equal to 4 9 at the beginning of the sequence and 45 
at the end of the sequence. The pitch detected by a 
correlation method is P = 47. Thus, s(n) will be 
shortened by k = 7 periods, i.e. = kP = 7x47 = 329 
samples . 

In this example, linear weighting has been selected. 
The sequences B and C show two pieces of the signal of 
length N^. = N " - 311 that have already been weighted 
and that are subsequently merged together. Merging is 
performed by adding these two signals together. In 
sequence C, it can be seen that because of the slight 
variation in pitch, these two pieces of the signal s(n) 
are somewhat phase-shifted. Because of the merging 
technique used, this does not give rise to any 
discontinuity in the output signal sMn) (continuous line 
in sequence D) . It can also be seen in sequence D that 
" the shortened signal s ' (n)" remains proiD'erly in phase with 
the signals that precede it and that follow it (dashed 
line in Figures 1 and 4) . 
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CLAIMS 

1/ A method of managing the decoding and playback (14) of 
a sound signal in an asynchronous transmission system, in 
which any overabundance of filling of a first buffer 
memory (11) and/or of a second buffer memory (13) 
situated at the inlet or at the outlet of a decoding 
block (12) is detected by comparing the filling level 
with at least one threshold, the method being 
characterized in that, depending on the value of the 
filling level: 

- voice activity detection is implemented and frames 
considered by said detection as being non-active are 
eliminated; and 

- concatenation processing is implemented on two 
successive frames to compact them into a pseudo- frame of 
length less than or equal to one frame, the length 
reduction ratio of the pseudo- frame relative to the 
length of the two frames being greater than or equal to 
two. 

2/ A method according to claim 1, characterized in that 
voice activity detection is implemented and frames 
considered by said detection as being not active are 
eliminated whenever the filling level lies between a 
first threshold and a second threshold, and in that 
concatenation processing is implemented on two successive 
frames whenever the filling level lies between a second 
threshold and a third threshold. 

3/ A method according to claim 2, characterized in that 
the first and second thresholds are the same. 

4/ A method according to any preceding claim, 
characterized in that detection is performed at the inlet 
or the outlet of a decoding block (12) having a first 
buffer memory (11) at its inlet and/or its outlet to 
determine whether any frame is missing or erroneous or 
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whether any samples to be played back are absent, and a 
fake frame is generated to ensure continuity in the audio 
playback on detecting such a missing or erroneous frame, 
or on detecting such an absence of samples for playback. 

5/ A method according to claim 4, characterized in that 
when the decoding block {12) implements its decoding 
processing in cyclical manner relative to the content of 
the first buffer memory (11) , detection of any missing or 
erroneous frame or of any absence of samples to play back 
is implemented at the same cyclical frequency, said 
detection taking place far enough in advance relative to 
the decoding process to make it possible to generate a 
fake frame in good time, 

6/ A method according to claim 4 or claim 5, 
characterized in that a fake frame is not generated when 
a missing or erroneous frame is detected for a frame on 
which an absence of samples has already been detected . 

7/ A method according to any one of claims 4 to 6, 
characterized in that, for a system of the type which can 
voluntarily stop sending frames, the type of the 
previously-generated frame is stored from one frame to 
the next, and this information is used to determine 
whether to generate fake frames or to generate frames of 
silence , 

8/ A method according to any preceding claim, 
characterized in that in processing for concatenating two 
successive frames, the samples are weighted in such a 
manner as to give more importance to the first samples of 
the first frame and to the last samples of the second 
frame . 

9/ A method according to any preceding claim, 
characterized in that the threshold (s) is/are adaptive. 
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10/ A method according to claim 9, characterized in that 
a threshold is adapted as a function of the length of 
time passed with a filling level above a given threshold. 

11/ A device for playing back a speech signal, the device 
comprising a first buffer memory (11) receiving coded 
frames, means implementing decoding processing (12) on 
the frames stored in said first buffer memory (11), a 
second buffer memory (13) receiving decoded frames output 
by the decoding means, and sound playback means (14) 
receiving the frames output by the second buffer memory 
(13) , the device being characterized in that it further 
comprises means for implementing the method according to 
any preceding claim. 
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ABSTRACT 



A METHOD OF MANAGING THE DECODING AND PLAYBACK OF A SOUND 
SIGNAL IN AN ASYNCHRONOUS TRANSMISSION SYSTEM 

A method of managing the decoding and playback of a 
sound signal in an asynchronous transmission system, in 
which method any overabundance of the filling of said 
buffer memory and/or of a second buffer memory at the 
inlet or the outlet of the decoding block is detected by 
comparing the filling level with at least one threshold, 
the method being characterized in that, depending on the 
value of the filling level: 

- voice activity detection is implemented and frames 
considered by said detection as being non-active are 
eliminated; and 

- concatenation processing is implemented on pairs 
of successive frames. 
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A METHOD OF MANAGING THE DECODING AND PLAYBACK OF A SOUND 
SIGNAL IN AN ASYNCHRONOUS TRANSMISSION SYSTEM 

The present invention relates to a method of 
managing asynchrony in audio transmission. 

5 

GENERAL DESCRIPTION OF THE FIELD OF THE INVENTION 

In general, the invention relates to transmission 
systems using low data rate speech encoders, in which the 
signals do not carry the reference clock of the source 
10 encoding system (the sampling frequency of the encoder) . 

This applies, for example, to Internet protocol (IP) type 
transmissions or indeed to discontinuous transmissions, 
etc . 

A general aim of the invention is to resolve the 

15 problems encountered by such systems in producing a 
continuous stream of decoded speech or sound. 

Traditionally, telephone communications and sound 
channel networks have used analog transmission systems 
with frequency division multiplexing (primary groups, 

20 amplitude and frequency modulation) . Under such 

conditions, the speech signal (or music signal; the term 
"speech" is used below throughout this document in 
generic manner) is converted into an electrical signal by 
a microphone and it is this analog signal which is 

25 filtered and modulated in order to be presented to a 

receiver which amplifies it prior to presenting it to a 
playback system (earphone, loudspeaker, etc.). 

Over the last few years, digital transmission and 
switching techniques have been progressively replacing 

30 analog techniques. In pulse code modulation (PCM) 

systems, the speech signal is sampled and converted into 
a digital signal using an analog- to-digital converter 
(ADC) driven at a fixed sampling frequency derived from a 
master clock delivered by the network and also known to 

35 the receiver system. This applies to analog and digital 
subscriber connection units in telecommunications 
networks. The digital signal received by the destination 



(in the broad sense) is converted back into analog so 
that it can be heard by means of a digital-to-analog 
converter (DAC) driven by a clock at the same frequency 
as that used by the ADC of the source. Under such 
conditions, the entire system is entirely synchronous as 
generally applies to present-day switching and 
transmission systems. These can include data rate 
reduction systems (for example for a telephone signal, 
for converting from 64 kilobits per second (kbit/s) to 
32 kbit/s or 16 kbit/s or 8 kbit/s) . It is the network 
(or terminal systems as in the case of the integrated 
services digital network (ISDN) for example) which 
undertakes the operations of ADC, of encoding and 
decoding (where encoding and decoding are used in the 
context of reducing data rate), and of DAC. The clocks 
are always distributed and the system comprising ADC, 
speech encoding, transmission and switching, speech 
decoding, and finally DAC is fully isochronous. There 
are no losses or repeats of speech samples in the 
decoder. 

The above -described synchronous transmission systems 
require the presence of a reference clock throughout the 
network. Transmission systems are now making greater and 
greater use (initially for data) of asynchronous and 
packet techniques (IP protocol, asynchronous transmission 
mode (ATM))- In numerous new situations, the decoder has 
no reference concerning the sampling frequency used by 
the encoder and it must be capable, using its own means, 
of reconstituting a decoding clock which attempts to 
track the reference of the encoder. The present 
invention is thus particularly advantageous in frame 
relay telephone systems, in ATM telephony, or in IP 
telephony. The technique described can easily be used in 
other fields of speech or sound transmission in which 
there exists no effective transmission of the clock 
reference from the encoder to the decoder. 
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DESCRIPTION OF THE STATE OF THE ART 
The general problem 

The general problem posed by transmission systems to 
which the invention applies is that of mitigating the 
fact that the speech or sound decoder has no clock 
reference associated with the source encoding. 

In this respect, two circumstances can be 
distinguished: those corresponding to "weak" asynchrony 
and those corresponding to "strong" asynchrony. 

Weak asynchrony 

As an illustration, we consider the case of a 
transmission system comprising the following, as shown 
diagrammatical ly in Figure 1: 

- an encoding source 1 comprising an analog-to- 
digital converter driven by a reference clock at 
frequency F^^ equal to 8 kilohertz (kHz) (to provide 
numerical values in the worked examples below) and a 
speech encoder (of greater or lesser complexity and 
reducing the data rate for transmission to a greater or 
lesser extent) ; 

- an asynchronous transmission system (represented 
by link 2) which conveys the information produced by the 
encoding source using its own transmission clock and its 
own protocols (for example the speech encoder could 
produce a data rate of 8 kbit/s and the transmission 
system could be constituted by an RS.232 type 
asynchronous link operating at 9600 bit/s); and 

- a reception and decoding system 3 receiving the 
information conveyed over the asynchronous link (whose 
data rate must necessarily be a little greater than the 
raw encoding data rate, e.g. 9600 bit/s instead of 
8000 bit/s) having the function of producing the signal 
after decoding (decompression) and applying the signal it 
produces to a digital-to-analog converter connected to a 
transducer such as a loudspeaker, a telephone handset, a 



headset, or a sound card installed in a personal computer 
(PC) . 

It will be understood that since the reception and 
decoding system 3 has no clock reference, it must 
implement a strategy in order to mitigate this asynchrony 
between the encoder and the decoder. 

whatever the encoding technique used or the type of 
transmission which does not directly convey a clock, or 
time markers within the transmitted frame, or indications 
concerning transmission instants, the above-mentioned 
problem can be reduced (ignoring the speech encoder, the 
asynchronous transmission system, and the speech decoder) 
to a system comprising the following, as shown in 
Figure 2 : 

- an analog-to-digital converter 4 for converting 
speech signals or sound from analog to digital form at a 
sampling frequency set by a local oscillator; 

- a digital-to-analog converter 5 for playing back 
the sound via a transducer suitable for the field of use 
in question and operating at a sampling frequency given 
by a local oscillator which, a priori , is at the same 
frequency but which is never at exactly the same 
frequency for reasons of acceptable manufacturing cost 
(highly stable and very accurate frequency sources do 
indeed exist, but they need to be temperature -compensated 
and they are unacceptably expensive for mass-produced 
industrial implementation) ; and 

- a digital register 6 into which the analog 
converter 4 writes at its own sampling frequency (F^^) ' 
said register being read at the sampling frequency (Fp^^) 
of the playback system by the digital-to-analog converter 
(DAC) . 

It will be understood that since the two clock 
frequencies (F^^ and Fj^^^) different, it is necessary 

from time to time for the DAC to reread the same 
information twice over (if F^ac is greater than F^^) or on 



the contrary (where F^^^. is less than F^^) to allow the 
ADC to overwrite information before the DAC can read it • 
The oscillators that are commonly available in the 
trade are characterized by the accuracy with which they 
5 operate {within a certain temperature range) . 

Oscillators that are accurate to within 50 parts per 
million (ppm) are quite commonly available and are used 
to provide numerical values for the worked examples below 
showing how frequently samples are lost or repeated when 

10 the sampling frequency is 8 kHz (the reader can easily 

determine that at higher sampling frequencies samples are 
skipped or repeated at a frequency which is prorata the 
sampling frequency; the higher the sampling frequency the 
higher the frequency at which samples are skilled or 

15 repeated) . 

Under the least favorable conditions, an ADC is 
operating at 8000x(l + 50.e-6) in association with a DAC 
operating at 8000x(l - 50.e-6) . In this particular 
example, the skip period (period for samples being 

20 omitted in the DAC since F^^^ is less than F^^) is easily 
calculated by counting the number of periods of the DAC 
(where the period is longer than that of the ADC) that 
produces a value equal to said period of the DAC when 
multiplied by the difference between the periods. 

25 Writing the period of the DAC as P^j^c this case 

1/8000 X (1 - 50.e-6)) and P^^. as the period of the ADC 
(in this case 1/8000 x (l + 50.e-6)) we obtain N x (Pd^c " 
^ADc) = ^DAc- ^ represents the number of individual 
operations that stem from the period difference. Writing 

30 50.e-'6 = s and applying the simplifications that are 

common for small quantities, we obtain N = 1/ (2s) . In 
this example, that immediately gives the skip period as 
being close to 1.25 seconds (s) . If the accuracy of the 
local oscillators is improved (e.g. by going from 50.e-6 

35 to 5.e-6) then the skip period will increase (in this 
case there will be one skip every 12.5 s) . 



'1 IHI»lllMI|||Pi Mil 'I I' 1111111" 



In a complete transmission system including audio 
encoders operating on signal frames, this phenomenon of 
"slip" between two clocks will give rise to an absence of 
speech frames (no frame to be decoded in the time 
available for decoding) or to overabundance of frames 
(i.e. two frames for decoding instead of one in the 
available time) . Taking the example of a speech encoder 
operating on 30 millisecond (ms) frames at 8 kHz, i.e. 
24 0 samples, in each 3 0 ms time slot the receiver and 
more particularly the decoder expects to receive one 
frame for decoding in order to ensure that playback of 
the speech signal remains continuous. Unfortunately, if 
^ADC less than Fp^^^' then on the above assumptions, 
there will be an absence of any frame of samples for 
decoding by the sound playback system once very 24 0 x 
1.25 = 300 s, and in the converse situation there will be 
two frames instead of one (i.e. a frame to be 
"eliminated") at the decoder once every 300 s. Under 
such circumstances, the awkward phenomenon of samples 
being skipped or repeated becomes very disagreeable since 
an entire block of the signal is skipped or repeated, and 
this needs to be managed appropriately. 

Strong asynchrony 

Certain types of transmission amplify this problem 
of asynchrony due to the phenomenon of "slip" between 
clocks as explained above. This is what we refer to 
herein as "strong" asynchrony. 

When transmission is imperfect, giving rise to 
samples or frames of samples being lost and also when 
transmission generates jitter on sample arrival times, 
where such jitter is associated neither with the sending 
clock nor with the receiving clock, but is associated 
with other mechanisms in the transmission system having 
their own clocks, then the receiver system can be 
confronted with an absence of several frames, or with an 
overabundance of several frames. This can apply, for 



example, with IP type networks which suffer from the 
phenomenon of packets being lost and from the phenomenon 
of jitter introduced during packet routing. These 
phenomena disturb the continuity of the sound playback of 
the audio signal very strongly. When packets are lost or 
when jitter delays one or more packets, the playback 
system finds itself without any sample (or frame of 
samples) to apply to the DAC for the purpose of ensuring 
continuity in audio playback. Conversely, when jitter is 
strong, the playback system can find itself with far too 
many frames or samples to be sent simultaneously to the 
DAC. When jitter is strong, sound signal frame 
transmission can take place in the form of bursts, thus 
creating phenomena both of gaps and of overabundance 
amongst sample frames. 

It will be observed that using speech encoders 
operating with a system of the voice activity 
detector/discontinuous transmission/comf ort noise 
generation (VAD/DTX/CNG) type, a mechanism is also 
introduced that is similar to the loss of a packet since 
in the event of silence, the sender will cease to send 
frames of samples. Ceasing to send samples can be 
perceived at the receiver as being the same as the loss 
of a packet or as circumstances in which the ADC clock is 
faster than the DAC clock, which leads to holes in the 
signal at the receiver, as shown above. 

"Strong" asynchrony thus differs from "weak" 
asynchrony by involving not only cyclical skips and/or 
repetitions, but also holes in the signal and/or 
overabundance of the signal in multiple and non-cyclic 
manner . 

Description of various existing methods 

Two main methods are presently known for mitigating 
the drawbacks due to the fact that the speech or sound 
decoder has no clock reference. 
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The first consists merely in proceeding as described 
above in the paragraph describing "weak" asynchrony, i.e. 
by skipping or repeating samples. The decoding system 
produces samples at a rate that is more or less equal to 
that of the encoder and it presents them to the digital- 
to-analog converter at said rate (means for implementing 
the above reconstruction system are known to the person 
skilled in the art) . In some cases, for example when 
"strong" asynchrony applies with transmission being in 
the form of frames, it is preferable when samples for 
playing back are missing to send null sample frames to 
the ADC rather than repeating a preceding frame. 
Furthermore, in the converse situation, when surplus 
samples are present, they are not eliminated directly, 
but a first-in-first-out (FIFO) register of some size can 
be used to absorb jitter to some extent. If the FIFO 
register becomes too full, then that triggers partial or 
complete emptying of the FIFO, thereby giving rise to new 
skips in sound playback. 

The second method, which is more complicated and 
provides better performance, requires a loop to be 
implemented to recover a hardware clock which is servo- 
controlled by the filling level of a buffer memory for 
the signal to be decoded (or to be transmitted as in the 
ATM adaptation layer number 1 (AALl) for example) . That 
method of servo-control attempts to use the clock 
recovery loop to recover the sampling frequency of the 
source- The filling level of the receive buffer produces 
a control signal for servo-controlling a digital or 
analog phase -locked loop (PLL) . 

The first method is extremely simple to implement 
but suffers from a major defect associated with the 
quality of the sound reproduced. A skip or elimination 
once every 1.25 seconds can be very disagreeable to 
listen to, and this can occur with "weak" asynchrony 
associated with correction at sample level. Similarly, 
for a system operating with frames of samples, the 



inserted repetitions or blanks, and the discontinuities 
in the signal due to frames being eliminated amplify loss 
of quality which becomes highly perceptible and very 
disturbing for the listener. 

Furthermore, the use of a FIFO memory runs the risk 
of establishing a considerable delay in transmission and 
that also harms the overall quality of a call- 

The second method is much more complex to implement 
and requires a clock servo-control mechanism, and thus 
requires special hardware. However, it provides partial 
synchronization and therefore avoids problems associated 
with managing asynchrony. Nevertheless, that method 
adapts poorly to discontinuous transmission systems, to 
systems involving last frames, or to systems with high 
levels of jitter. Under such circumstances, 
synchronization information is no longer available. 
Furthermore, that method cannot be envisaged on terminal 
platforms where clock servo-control is not possible, as 
is the case in particular with PC type terminals, for 
example, where the system used for playing back sound is 
a sound card. 

SUMMARY OF THE INVENTION 

A general object of the invention is to propose a 
solution to the problems associated with continuity in 
the playback of a speech signal in the presence of 
asynchronous transmission, and to do so by taking action 
at receiver level, i.e. at the end of the transmission 
system. 

To this end, the invention provides a method of 
managing the decoding and playback of a sound signal in 
an asynchronous transmission system, in which method any 
overabundance of the filling of said buffer memory and/or 
of a second buffer memory at the inlet or the outlet of 
the decoding block is detected by comparing the filling 
level with at least one threshold, the method being 



characterized in that, depending on the value of the 
filling level: 

- voice activity detection is implemented and frames 
considered by said detection as being non-active are 

el iminated ; and 

- concatenation processing is implemented on pairs 
of successive frames. 

Such a method is simple to implement and provides a 
guarantee of quality by avoiding excessive increase in 
transmission delay and by managing holes in the speech 
signal effectively. Furthermore, it does not imply any 
specific hardware servo-control circuit, and can 
therefore be quickly adapted to different asynchronous 
networks, terminals, and platforms. 

The method is advantageously associated with the 
various characteristics below taken singly or in any 
technically feasible combination: 

- voice activity detection is implemented and frames 
considered by said detection as being not active are 
eliminated whenever the filling level lies between a 
first threshold and a second threshold, and in that 
concatenation processing is implemented on two successive 
frames whenever the filling level lies between a second 
threshold and a third threshold; 

- the first and second thresholds are the same; 

- detection is performed at the inlet or the outlet 
of a decoding block having a first buffer memory at its 
inlet and/or its outlet to determine whether any frame is 
missing or erroneous or whether any samples to be played 
back are absent, and a fake frame is generated to ensure 
continuity in the audio playback on detecting such a 
missing or erroneous frame, or on detecting such an 
absence of samples for playback; 

- when the decoding block implements its decoding 
processing in cyclical manner relative to the content of 
the first buffer memory, detection of any missing or 
erroneous frame or of any absence of samples to play back 



is implemented at the same cyclical frequency, said 
detection taking place far enough in advance relative to 
the decoding process to make it possible to generate a 
fake frame in good time; 

- a fake frame is not generated when a missing or 
erroneous frame is detected for a frame on which an 
absence of samples has already been detected; 

- for a system of the type which can voluntarily 
stop sending frames, the type of the previously-generated 
frame is stored from one frame to the next, and this 
information is used to determine whether to generate fake 
frames or to generate frames of silence; 

- in processing for concatenating two successive 
frames, the samples are weighted in such a manner as to 
give more importance to the first samples of the first 
frame and to the last samples of the second frame; 

- the threshold (s) is/are adaptive; and 

- a threshold is adapted as a function of the length 
of time passed with a filling level above a given 
threshold. 

The invention also provides a device for playing 
back a speech signal, the device comprising a first 
buffer memory receiving coded frames, means implementing 
decoding processing on the frames stored in said first 
buffer memory, a second buffer memory receiving decoded 
frames output by the decoding means, and sound playback 
means receiving the frames output by the second buffer 
memory, the device being characterized in that it further 
comprises means for implementing the above -specified 
method . 

As will be understood on reading the following 
description, these means are essentially computer means. 

DESCRIPTION OF THE FIGURES 

Other characteristics and advantages of the 
invention appear further from the following description 
which is purely illustrative and non-limiting and which 
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should be read with reference to the accompanying 
figures, in which: 

- Figure 1 is a block diagram of an asynchronous 
t r ansmi s s i on sy s t em ; 

- Figure 2 is a diagram showing a model of such a 
transmission system; 

- Figure 3 is a diagram of a receiver device; and 

- Figure 4 shows the signals obtained by 
implementing concatenation processing as proposed by the 
invention. 

DETAILED DESCRIPTION OF ONE OR MORE EMBODIMENTS 

The method proposed by the invention for managing 
asynchrony in transmission implements two kinds of 
processing corresponding to handling the two phenomena 
described above, namely the lack of samples and surplus 
samples . 

1* Description of the sound playback system in a 
conventional transmission application 

As shown in Figure 3, the playback system for a 
speech signal comprises three elements: 

- A block 10 waiting to receive samples or frames of 
code coming from the network. The block 10 contains a 
FIFO type memory 11 or circular type buffer memory 
(referred to as "FIFO 1" in the description below) 
enabling frames to be stored on a temporary basis prior 
to being decoded. 

- A decoding block 12 which takes the frames coming 
from the block 10, decodes them, and stores them in turn 
in a FIFO memory 13 (referred to below as "FIFO 2") . 

- A playback block 14 which takes the decoded sample 
frames and applies them to any kind of sound playback 
system 15 . 

Depending on the terminals and the way the system is 
organized, the clock frequency used for sound playback 
(i.e. the digital-to-analog converter frequency F^^c) 



not necessarily directly associated with all of the 
blocks. Since the block 14 is directly associated with 
the playback system, it is directly associated with the 
frequency F^j^^. However the other blocks can be 
associated instead with the rate at which frames arrive 
from the network rather than with the frequency FpAc- 
Taking the example of a terminal provided with a 
multitasking system, and in which each block is performed 
by a specific task, the tasks 10 and 12 can thus be 
associated with frame reception. The task 10 waits for a 
frame from a network, which frame is then decoded by the 
task 12 and placed in the memory FIFO 2 . 

Meanwhile the task 14 clocked at F^ac takes samples 
from the memory FIFO 2 and delivers them continuously to 
the sound playback system. 

It can thus be seen that regardless of whether the 
asynchrony is "strong" or "weak", it is the way in which 
the memory FIFO 2 is managed that requires particular 
attention. Similarly, if the task 12 were strongly 
associated with the task 14, then particular attention 
would be required by the memory FIFO 1. 

The mechanism constituting an implementation of the 
invention is described below in application to managing 
the memory FIFO 2, but the description includes 
explanations about how to transpose it with certain 
adaptations to managing the memory FIFO 1. 

2 . Absence of samples 

In order to continue playing back sound in the 
absence of samples, both potential causes of samples for 
playback being absent are treated. The first cause 
corresponds to information contained in lost packets, 
while the second cause corresponds to the absence of any 
samples to play back (e.g. FIFO 2 empty) even though it 
is still necessary to keep on sending samples to the 
sound playback system. 
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2.1 Loss of frames or erroneous frames 

The processing applied to lost frames or to 
erroneous frames requires a transmission system to be 
available that gives access to information about frames 
being lost and about erroneous frames being received. 
This is often the case in transmission systems. 

For example, in IP networks, it is possible to use 
the marking of packets coming from the real time transfer 
(RTP) layer, which marking gives the exact number of 
samples lost between two packets of audio code being 
received. This information about loss of frames, or in 
the case of IP about loss of packets (each containing one 
or more speech frames) generally becomes available only 
once the packet following the lost packet (s) is itself 
received. 

It is not necessarily advantageous to take action, 
while one or more valid frames can be decoded. With new 
generation speech encoders (CELP encoders, transform 
encoders, ...), in order to ensure that the quality of 
sound playback is maintained, it is often necessary to 
ensure a degree of synchronism between the encoder and 
the decoder. The loss of this encoder/decoder 
synchronism can be compensated by using frame loss 
correction algorithms associated with the speech encoder 
used. By way of example, these algorithms are provided 
in the standards for certain speech encoders (e.g. 
International Telecommunications Union (ITU) standard 
G. 723.1). When using simpler encoders, such a mechanism 
is not always necessary. 

When a large number of frames have been lost, the 
number of "fake" sample frames that need to be generated 
in order to pack out the memory FIFO 2 can be limited. 
The purpose of processing fake frame generation is to 
fill holes in such a manner as to ensure signal 
continuity while also smoothing the internal variables of 
the decoder so as to avoid excessive divergence on 
decoding the first valid frame following the invalid or 



lost frames, thereby avoiding any audible discontinuity. 
After a few frames have been generated, it can be assumed 
that the variables have been smoothed, and thus that the 
generation of such fake frames can be limited to a small 
number of frames (e.g. four to six) whenever a large 
number of frames has been lost . 

It will thus be understood that processing is servo- 
controlled in this way relative to information about lost 
frames . 

Similar processing is implemented on the basis of 
information about invalid frames. This information is 
forwarded to the decoder by the network portion of the 
receiver and it arrives soon enough to enable a frame 
correction algorithm to be implemented which, by taking 
account of such a non-valid frame, makes it possible to 
ensure continuity in the signal, and thus to avoid having 
another cause for samples being absent in the memory 
FIFO 2 . 

To sum up, this first process corresponds to 
managing information of the type "n frames lost" or 
"invalid frame received" coming from the network layer of 
the receiver. This management is characterized by 
implementing an algorithm for correcting frame losses 
(also referred to in this document as an algorithm for 
generating "fake" frames) . This first process therefore 
acts at decoding task level and feeds the memory FIFO 2. 

2.2 Absence of samples to be played back 

This second process is associated with the clock 
coming from the task 14, i.e. with the clock at the 
frequency F^ac- mentioned above, the memory FIFO 2 (or 

FIFO 1 if the task 12 is included in the task 14) can 
become empty of samples even though it is still necessary 
to supply samples to the sound playback system. It is 
then necessary to supply the playback system with samples 
and if possible to avoid playing back zeros (since this 
degrades the sound signal very greatly) . 



This second process can be thought of as a feedback 
loop on frame decoding. This loop causes the algorithm 
for correcting frame losses to be called and as a result 
it needs to be activated soon enough to enable the 
algorithm to be executed and to enable samples to be sent 
to the sound playback system. Depending on the platform, 
this feedback can be called in different ways. 

This loop can be implemented in two ways which are 
described below. 

For a single-task receiver (e.g. a digital signal 
processor (DSP) without any real time operating system 
(RTOS) ) , the audio decoder portion is tied completely to 
the DAC clock {F^ac) ^ind is therefore permanently waiting 
for a frame to be decoded in cyclical manner. For 
example, with a speech encoder using 3 0 ms frames, 
waiting loops are built up in periods that are multiples 
of 3 0 ms. 

Thus, for a 30 ms loop, the decoder will, every 
3 0 ms, be expecting a frame for decoding to be placed in 
the memory FIFO 1 (which can correspond merely to a frame 
passing from the network layer to the task 12) • On 
arrival of the frame, it is decoded and placed in the 
memory FIFO 2 for sending to the DAC. The feedback 
processing is implemented whenever it is observed that 
there is no frame for decoding in the memory FIFO 1 at 
the time given by: 

T = To + 30 ms - Tc 

where : 

To = the start time of the 30 ms wait loop; and 
Tc = the time required for executing the algorithm 
for generating fake frames with a safety margin 
corresponding to interrupts and/or other auxiliary 
processing that might take place before the end of the 
loop . 

Processing is thus implemented with a latency time 
deadline of Tb (loop time) - Tc (computation time + 
margin) . 
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With a multitasking receiver (e.g. a PC terminal), 
time is not managed with such precision and the 
processing implemented must therefore be somewhat 
different. (Note: this processing nevertheless remains 
quite close to the preceding process since it too seeks 
to take account of the time Tc.) 

Under such circumstances, the only waiting loops 
available are often those associated with events, e.g. 
the fact that packets have been received by the network, 
or the fact that buffer memory n (containing one or more 
sample frames) sent previously to the sound playback 
system has been read by the DAC and is therefore again 
available for sending samples to the DAC. 

Depending on the structure of the system and on 
whether or not it is necessary to respond quickly to an 
event, it is possible to wait for a certain length of 
time before filling said buffer memory prior to 
forwarding to the DAC. Such a latency time is selected 
in such a manner as to leave enough time for the 
algorithm for generating "fake" frames to execute, if 
necessary. 

Then, possibly after said latency time has elapsed, 
the process verifies that sufficient samples are present 
in FIFO 2 (note: this could apply to FIFO 1 if the 
management takes place at its level) , and if not it 
requests an appropriate number of fake frames to be 
generated in order to fill buffer memory n. 

When the system is such that it is necessary to fill 
buffer memory n "immediately", then monitoring the 
availability of samples and possibly calling for the 
"fake" frames generation processing are implemented 
directly after each delivery to the DAC from the buffer 
memory so that the generated samples are already in the 
memory FIFO 2 when the event "buffer memory n available" 
occurs . 

Thus, whatever the receiver, the process observes 
the absence of samples to be sent to the sound playback 
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system by implementing a check on the content of the 
buffer memory FIFO 2 (or FIFO 1 depending on how the 
sound playback system is managed) and activates the 
algorithm for generating "fake" frames in order to 
generate the missing samples. 

It will be understood that the second process 
responds firstly to the problem of "slip" between clocks, 
and more precisely to the circumstance in which the 
received clock (F^ac) is faster than the send clock (F^^) • 
It also applies to the phenomenon of frames being lost 
since this can lead to there being an absence of samples 
to send to the DAC even before frame loss has been 
detected, since such detection occurs only on receiving 
the frame following the loss. 

In order to combine the actions of the first and 
second processes, the first process is prevented from 
generating "fake" frames on detection of frame loss 
whenever the corresponding frames have just been 
generated by the second process. 

For this purpose, use is made of flags and also of 
counters determining the number of samples that have been 
generated by the second process. 

2.3 Specific actions for speech encoders using 
VAD/DTX/CNG services 

Encoders using a VAD/DTX/CNG system can voluntarily 
stop sending frames; under such circumstances, the 
absence of samples must not be considered exactly as a 
loss of frames, but rather as a period of silence. The 
only way of determining whether the frame to be generated 
must be silence or should correspond to a lost frame is 
to know the type of the previously-generated frame (i.e. 
signal frame or frame corresponding to a lost frame, or a 
noise update frame (SID) , or a frame of silence (NOT) ) . 
For this purpose, the type of the generated frame is 
stored, and while frames are being generated to 
compensate for an absent frame or a lost frame, it is 



decided whether fake frames should be generated using the 
algorithm for correcting frame losses (as applies when 
the preceding frame was of the FSF type) , or whether 
frames of silence should be generated by activating the 
decoder appropriately (as applies when the preceding 
frame was of the SID or the NOT type) . 

3 . Overabiindance of samples to be played back 

In order to deal with an overabundance of samples to 
be played back, processing is implemented to empty out 
frames, eliminating certain frames in full or in part 
prior to their possibly being taken into account by the 
sound playback system. 

This processing enables frames to be stored in 
buffer memories until certain thresholds trigger actions 
for limiting the amount of frames in memory and for 
limiting any corresponding increase in delay across the 
communications system. This limited storage makes it 
possible to accommodate jitter phenomena on receiving 
frames in bursts and also slip between clocks, while 
nevertheless limiting transmission delay. 

3.1 Emptying out processing 

Any accumulation of frames is initially detectable 
in the memory FIFO 1, and is subsequently transferred to 
the memory FIFO 2. 

The proposed method manages information concerning 
the filling level of a reference buffer memory, i.e. 
FIFO 1 or FIFO 2 depending on how the tasks 10, 12, and 
14 are organized in the receiver (as explained above) . 
If the tasks 12 and 14 are associated with each other, 
then the filling level information used by the method 
relates to the memory FIFO 1 which acts as a buffer 
between the network and the sound playback system. 
Similarly, if the tasks 10 and 12 are associated, then it 
is the memory FIFO 2 which acts as a buffer and it is 



therefore its filling level which is taken into 
consideration by the management process. 

The process is explained below for the second 
context. The first is merely an immediate transposition 
thereof . 

In order to maintain synchronization as closely as 
possible between the encoder and the decoder, and thus 
maintain optimum sound playback, all of the frames coming 
from the network are decoded. The process then decides 
on what action to take on the decoded frame as a function 
of information concerning filling level. This action is 
described in greater detail below. To activate the 
processing, filling level thresholds are used. These 
thresholds define filling alarm levels for the FIFO 
memory. In order to take action that is as inaudible as 
possible (i.e. in order to limit quality degradation) two 
levels of action are selected. A first level (alarm 
level 1) corresponds to the filling level being excessive 
but not critical (far from the maximum acceptable filling 
level) , while a second level (alarm level 2) corresponds 
to it being mandatory to take action on each frame (this 
level is quite close to the maximum acceptable level) . A 
third or "emergency" level (alarm level 3) is also 
defined in order to avoid memory overflows or other 
problems- This level corresponds to filling being very 
close to the maximum acceptable level. Alarm level 3 
should never be reached if the actions taken at the two 
preceding threshold levels are properly performed and if 
the thresholds are properly defined. 

Each time decoding is performed, the information 
concerning filling level is compared with the thresholds 
in order to determine the state of the FIFO (in an alarm 
state or not), and, where appropriate, the level of the 
alarm. 

If the state obtained is not an alarm state, then no 
action is undertaken and the decoded frame is stored in 
FIFO 2. 



In the first alarm state, it is considered that at 
least 50% of the signal coming from a conversation is not 
useful and therefore at this alarm level, all frames 
presenting very little information are eliminated. For 
this purpose, it is possible to implement simple VAD 
which monitors all frames of samples after they have been 
decoded to decide whether or not they should be written 
into FIFO 2. The process can also make decisions based 
on information taken directly from the code frame 
concerning the importance or otherwise of the information 
contained in the frame. In this alarm state, any frame 
that is considered as containing nothing but noise is 
simply not stored in FIFO 2 for future sound playback. 

In the second alarm state (critical level) , it is 
necessary to take action on each frame to curb growth in 
the filling level of the memory FIFO 2 very aggressively. 
At this level, the preceding processing (i.e. the 
processing implemented for alarm level 1) remains active. 
However it is now also necessary to shorten pairs of 
consecutive frames down to the length of one frame or 
shorter. A decision is therefore taken on the basis of 
two non-" silent" sample frames (given that any frame that 
is "silent" is merely not written to FIFO 2 as a result 
of alarm state 1 being already included in alarm state 
2) . Action on two consecutive frames is therefore 
undertaken only when a frame is detected as being non- 
"silent". The frame is initially stored, and then if the 
next frame is "silent", then it is only the first frame 
that is written into FIFO 2 . 

When both frames contain important information, it 
becomes necessary to replace them by a single frame while 
minimizing loss of information and degradation of 
quality. It is the replacement frame that is stored in 
FIFO 2. Any effective solution capable of performing 
this task can be used and activated under such conditions 
(i.e. second alarm state and two non-"silent" frames). 



Two examples of algorithms for performing this task are 
described below. 

In a first algorithmic solution, the two contiguous 
frames are replaced by a single frame in which each 
coefficient Xj (where i lies in the range 0 to N-1 and 
where N is the number of samples per frame) is given the 
value (Xi + Xi^i)/2 (where i lies in the range 0 to 2N-1, 
with the coefficients x^ coming from both original 
frames) . This solution amounts to performing a kind of 
smoothed undersampling . The frequency of the played-back 
signal is thus doubled for this frame. Nevertheless, the 
inventors have found that providing alarm state 2 does 
not occur very frequently, this solution suffices to 
maintain the quality of sound playback. 

In a second solution, signal amplitude is detected 
to enable the two frames to be compacted into a pseudo- 
frame of length shorter than or equal to that of one 
frame. The number of samples contained in the pseudo- 
frame is determined by the fundamental frequency or 
"pitch" information, but in all events it is shorter than 
or equal to the length of a nomal frame, while 
nevertheless being close to said frame length. The 
algorithm used ensures continuity of the played-back 
signal without any audible hole, and without any 
frequency doubling, while nevertheless dividing the 
amount of storage required for the signal by a factor 
that is greater than or equal to 2. This is explained in 
greater detail in paragraph 3.4 below. Furthermore, this 
also minimizes the loss of sound information since less 
than 50% of the information is in fact eliminated. 

It should be obsearved that when the receiver 
implements its processing on the basis of analyzing 
FIFO 1, with the decoder being directly associated with 
the sound playback system, it is necessary to generate a 
number of samples that is sufficient, i.e. in the present 
case a number that ensures at least one frame of samples 
is made available in FIFO 2. The frame concatenation 



algorithm is then calibrated to ensure that it always 
generates a minimum number of samples, but at least one 
frame. Another solution would consist in activating the 
algorithm several times over instead of only once when 
that is necessary to obtain a sufficient number of 
samples . 

In the third alarm level, which is normally never 
reached, no frames are stored in FIFO 2. In a variant, 
the system can also decide to clear out a fraction of the 
buffer memory suddenly (this can apply where it is 
management of FIFO 1 that is activated) . 

It should also be observed that depending on the 
network and on the types of problems it generates in 
terms of asynchrony, it is possible to decide whether or 
not to activate particular alarm levels. For example, 
when asynchrony is "weak" then alarm levels 1 and 2 can 
be combined, and the simple solution of replacing two 
frames by a single frame can then constitute the only 
active process. 

3,2 Alarm thresholds 

There follows a more detailed description of the 
alarm thresholds and how they are managed. 

As explained above, the reference memory is said to 
be in alarm state 1 when its filling level is above 
threshold 1; this state remains active until its filling 
level drops below threshold 0. State 1 therefore 
operates with hysteresis. 

The memory is said to be in alarm state 2 if the 
filling level exceeds threshold 2 and to be in alarm 
state 3 if the filling level exceeds threshold 3. It is 
possible to envisage managing these alarm states with 
hysteresis as well. 

Thresholds 0, 1, and 2 are adaptive. Threshold 3 is 
directly associated with the maximum acceptable size and 
it is fixed. These thresholds need to be adaptable in 
order to accommodate different call contexts and real 



time fluctuations during a call. It is appropriate to 
allow a greater amount of delay when the call is 
suffering a large amount of jitter (where delaying 
playback remains the best way of ensuring acceptable 
5 quality in the presence of jitter) . In a high- jitter 

context, it is therefore appropriate for the thresholds 
0, 1, and 2 to be at quite high levels. 

To facilitate processing, the positions of the 
thresholds can correspond to integer numbers of frame 
10 sizes as exchanged between the various tasks of the 
receiver. This frame size is written Tt , 

By way of example, the initial values of these 
1^, thresholds can be as follows: 

C Threshold 0: 5 x Tt 

5 15 Threshold 1: 8 x Tt 

CP Threshold 2 : 12 x Tt 

5 z'i 

Hj! Threshold 3: 24 x Tt (fixed value) 

O The thresholds 0, 1, and 2 can be adapted together 

i,, in steps of size Tt . Extreme acceptable values can lie 

M 20 in the range -1 to +8, for example. 

nj Thus, threshold 1 can take on values 7x, 8x, 9x, lOx, 

S I6x Tt. Threshold adaptation proper is performed on 

fy the basis of an adaptation criterion which is the length 

of time spent in the alarm state. For this purpose, an 
25 alarm state percentage is evaluated about once every N 
seconds (e.g. N = 10) . When this percentage exceeds a 
given threshold (5%) , the alarm state thresholds are 
increased; otherwise, when these percentages are below a 
given minimum threshold (0.5%) the alarm threshold are 
3 0 decreased. To ensure that the system does not oscillate 
excessively due to its thresholds being adapted too 
frequently, hysteresis is applied to adaptation decision 
making. The thresholds are actually increased by one 
step, only in the presence of two increase options that 
35 are consecutive, and they are decreased by one step, only 
in the presence of three decrease options that are 
consecutive. As a result, the length of time between two 
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threshold increments is at least 2N seconds and the 
length of time between two threshold decrements is at 
least 3N seconds. The procedure for increasing 
thresholds can be accelerated if a large percentage of 
5 frames are in an alarm state. One accelerating procedure 
consists in increasing the thresholds directly, for 
example whenever the alarm percentage exceeds 50%. 

Naturally, the threshold values given for the alarm 
thresholds are provided purely by way of indication. 

10 

3.3 Interaction with the first process 

The first process is the process which causes "fake" 
frames to be generated when frames are lost or erroneous. 
When the system is in an alarm state (overabundance of 

15 frames), there is no need to generate "fake" frames which 
would merely aggravate the phenomenon of overabundance. 
Nevertheless, in order to maintain high quality sound 
playback it is important to maintain encoder /decoder 
synchronization by informing the decoder whenever a frame 

2 0 has been lost (e.g. by launching the generation of one or 
two fake frames, but no more) . The third process will 
act in the alarm state on the first process so as to curb 
very strongly the generation of "fake" frames. 

25 3.4 Frame concatenation 

The purpose of the concatenation process is to 
shorten the duration of a digital audio signal containing 
speech or music while introducing as little audible 
degradation as possible. Since the sampling frequency is 

30 given and fixed, it is the number of samples sent to the 
sound playback apparatus that is decreased. One obvious 
solution for shortening a sequence of N samples is to 
remove M regularly spaced apart samples from the sequence 
in question. This causes the fundamental frequency to 

35 increase and that can be unpleasant for the listener, 

particularly when the ratio M/N is large. Furthermore, 
there is a danger of no longer complying with the 
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sampling theorem. The process described below makes it 
possible to shorten an audio sequence without modifying 
its fundamental frequency and without giving rise to 
audible degradation due to signal discontinuity. This 
process is based on detecting the value of the pitch 
period. The number of samples eliminated by this 
algorithm cannot be selected freely, since it is a 
multiple of the pitch value P. Nevertheless, it is 
possible to define a minimum number of samples to be 
eliminated N^^i^ which must satisfy the relationship N^^i^ < 
N/2. Since the purpose is to eliminate at least 50% of 
the samples in the context of the device for managing 
asynchrony in an audio transmission, it is advantageous 
to set N^roin • also assumed that the maximum 

value of the pitch P is less than the length N of the 
sequence to be shortened. The number of samples that 
are eliminated by the algorithm is then the smallest 
multiple of the pitch value P that is greater than or 
equal to N^n,in- I . e . = kP where k is a positive integer 
and > N^niin > " ^' length of the output signal 

is then = N ~ N^. The input signal to be shortened is 
written s (n) , where n = 1, N and the output signal 

is written s» (n) where n = 1, N^. In order to ensure 

continuity in the output signal, the first and the last N^ 
samples of the signal s (n) are merged progressively, i.e. 
s' (n) = sCN^^n) .w(n) + s (n) . (1-w/ (n) ) for n = 1, . . . , N^, 
where w(n) is a weighting function such that 0 < w(n) < 
1, for n = 1, . . . , N^ and w(n) < w{n+l) for n = 1, . . . , 
N3.-I. For example, w(n) can merely be the linear function 
w(n) = n/N^. For an unvoiced signal where it is not 
possible to determine the pitch, N^ can be fixed freely. 

Figure 4 showing signal sequences A, B, C, and D 
illustrates how the process is implemented on a worked 
example. The first sequence (A) is shown as a solid line 
and constitutes a piece of the signal s (n) to be 
shortened that is N = 64 0 samples long. The purpose is 
to shorten this sequence by at least 320 samples without 
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changing its fundamental frequency, and without 
introducing any discontinuity or other audible 
degradation. The pitch of s (n) varies slowly, its value 
being equal to 49 at the beginning of the sequence and 45 
at the end of the sequence. The pitch detected by a 
correlation method is P = 47. Thus, s (n) will be 
shortened by k = 7 periods, i.e. = kP = 7x47 - 329 
samples . 

In this example, linear weighting has been selected. 
The sequences B and C show two pieces of the signal of 
length Nj. = N - = 311 that have already been weighted 
and that are subsequently merged together. Merging is 
performed by adding these two signals together. In 
sequence C, it can be seen that because of the slight 
variation in pitch, these two pieces of the signal s(n) 
are somewhat phase-shifted. Because of the merging 
technique used, this does not give rise to any 
discontinuity in the output signal s'(n) (continuous line 
in sequence D) . It can also be seen in sequence D that 
the shortened signal s' (n) remains properly in phase with 
the signals that precede it and that follow it (dashed 
line in Figures 1 and 4) . 
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CLAIMS 

1/ A method of managing the decoding and playback of a 
sound signal in an asynchronous transmission system, in 
which method any overabundance of the filling of said 
5 buffer memory and/or of a second buffer memory at the 

inlet or the outlet of the decoding block is detected by 
comparing the filling level with at least one threshold, 
the method being characterized in that, depending on the 
value of the filling level: 
10 - voice activity detection is implemented and frames 

considered by said detection as being non-active are 
e 1 imi na t e d ; and 

y. - concatenation processing is implemented on pairs 

O of successive frames. 

^3 2/ A method according to claim 1, characterized in that 

voice activity detection is implemented and frames 
p considered by said detection as being not active are 

L, eliminated whenever the filling level lies between a 

2 0 first threshold and a second threshold, and in that 

concatenation processing is implemented on two successive 
frames whenever the filling level lies between a second 
threshold and a third threshold. 



25 3/ A method according to claim 2, characterized in that 
the first and second thresholds are the same. 



4/ A method according to any preceding claim, 
characterized in that detection is performed at the inlet 

3 0 or the outlet of a decoding block having a first buffer 
memory at its inlet and/or its outlet to determine 
whether any frame is missing or erroneous or whether any 
samples to be played back are absent, and a fake frame is 
generated to ensure continuity in the audio playback on 

3 5 detecting such a missing or erroneous frame, or on 
detecting such an absence of samples for playback- 
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5/ A method according to claim 4, characterized in that 
when the decoding block implements its decoding 
processing in cyclical manner relative to the content of 
the first buffer memory, detection of any missing or 
erroneous frame or of any absence of samples to play back 
is implemented at the same cyclical frequency, said 
detection taking place far enough in advance relative to 
the decoding process to make it possible to generate a 
fake frame in good time. 

6/ A method according to claim 4 or claim 5, 
characterized in that a fake frame is not generated when 
a missing or erroneous frame is detected for a frame on 
which an absence of samples has already been detected. 



7/ A method according to any one of claims 4 to 6, 
characterized in that, for a system of the type which can 
O voluntarily stop sending frames, the type of the 

previously-generated frame is stored from one frame to 
2 0 the next, and this information is used to determine 
fi? whether to generate fake frames or to generate frames of 

p5 silence. 
5 y 

8/ A method according to any preceding claim, 
25 characterized in that in processing for concatenating two 
successive frames, the samples are weighted in such a 
manner as to give more importance to the first samples of 
the first frame and to the last samples of the second 
frame . 

30 

9/ A method according to any preceding claim, 
characterized in that the threshold (s) is/are adaptive. 

10/ A method according to claim 9, characterized in that 
35 a threshold is adapted as a function of the length of 

time passed with a filling level above a given threshold. 
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11/ A device for playing back a speech signal, the device 
comprising a first buffer memory receiving coded frames, 
means implementing decoding processing on the frames 
stored in said first buffer memory, a second buffer 
memory receiving decoded frames output by the decoding 
means, and sound playback means receiving the frames 
output by the second buffer memory, the device being 
characterized in that it further comprises means for 
implementing the method according to any preceding claim. 



•w rr 



I N il " " <i I I I 



31 



ABSTRACT 

A METHOD OF MANAGING THE DECODING AND PLAYBACK OF A SOUND 
SIGNAL IN AN ASYNCHRONOUS TRANSMISSION SYSTEM 

5 

A method of managing the decoding and playback of a 
sound signal in an asynchronous transmission system, in 
which method any overabundance of the filling of said 
buffer memory and/or of a second buffer memory at the 
10 inlet or the outlet of the decoding block is detected by 
comparing the filling level with at least one threshold, 
the method being characterized in that, depending on the 
value of the filling level: 

- voice activity detection is implemented and frames 
15 considered by said detection as being non-active are 

eliminated; and 

- concatenation processing is implemented on pairs 
of successive frames. 



20 



25 



30 



Translation of the title and the abstract as they were when originally filed by the 
3 5 Applicant, No account has been taken of any changes that may have been made 

subsequently by the PCX Authorities acting ex officio , e.g. under PCX Rules 37.2, 
38.2, and/or 48 .3 . 



10/019550 



1/Z 



Local sampling 
frequency Fade 



input speecii or 
sound signal 





Sound pick-up system witli 
low data rate encoding 



L 



Speech or sound decoding 
and play-back system 




Local sampling 
frequency Fdac 



FIG. 1 



2 
± 



1 

asynchronous link 
I 



output speech or 
sound signal 




Sampling 
frequency 
Fade 



write codes 
to register 



Speech signal 
microphone and 
preamp 





Analog-to-digitaf 
converter ADC 



FIG. 2 



Register: both ADC output 
and DAC input 

4^ 



Read codes 
In register 



Digital-to-analog 
converter DAC 




Output speecti\ 
signal: amplifier and j 
loud-speaker 



I l l wn ii m i 



2/2 



Network FIFO 1 



FIG. 3 

FIFO 2 



Code ^— \ 
frame 



10 



-11 



Decoder 



^13 



Sound play-back 

system 



12 



14 



15 



Seq, A 



Seq. B 



Seq. C 



FIG. 4 



700 




100 



200 300 400 500 600 700 




r^f.: 1565.P387 



ECLARATION AND POWER OF ATTORNEY FOR PATENT APPLICATION 



As a below named inventor, I hereby declare that: 

My residence, post office address and citizenship are as stated below, next to my name, 

I believe I am the original, first and sole inventor (if only one name is listed below) or an original, first and 
joint inventor (if plural names are listed below) of the subject matter which Is claimed and for which a patent 
is sought on the invention entitled 

METHOD FOR DECODING AND RETRIEVING A SOUND SIGNAL IN AN ASYNCHRONOUS TRANSMISSION 
SYSTEM 



the specification of which 

is attached hereto 

was filed on June 22, 2000 as 
^ Application Serial No. PCr/FROO/01734 

And was amended on 
(if applicable) 

I hereby state that I have reviewed and understand the contents of the above-identified specification, 
^ including the claims, as amended by any amendment referred to above. I do not know and do not believe 

that the same was ever known or used in the United States of America before my invention thereof, or 
O pat-ented or described in any printed publication in any country before my invention thereof or more than one 
£ year prior to this application, that the same was not in public use or on sale in the United States of America 
O more than one year prior to this application, and that the invention has not been patented or made the 
yj subject of an inventor's certificate issued before the date of this application in any country foreign to the 
ni United States of America on an application filed by me or my legal representatives or assigns more than 
J3 twelve months prior to this application. 

fll I acknowledge the duty to disclose information which is material to the examination of this application in 
accordance with Title 37, Code of Federal Regulations, Section 1.56(a). 



I hereby claim foreign priority benefits under Title 35, 
application(s) for patent or inventor's certificate listed 
application for patent or inventor(s) certificate having a 
priority is claimed: 

Prior Foreign Application(s) 



United States Code, Section 199, of any foreign 
below and have also identified below any foreign 
filing date before that of the application on which 

Priority Claimed 



99/08081 FRANCE 24 JUNE 1999 X 

(Number) (Country) {Day/Month/Year Filed) Yes No 



(Number) (Country) (Day/Month/Year Filed) Yes No 



(Number) (Country) { Day/Month/Year Filed) Yes No 



I hereby claim the benefit under Title 35, United States Code, Section 120 of any United States 
application{s) listed below and, insofar as the subject matter of each of the claims of this application is not 
disclosed in the prior United States application in the manner provided by the first paragraph of Title 35, 
United States Code, Section 112,1 acknowledge the duty to disclose material information as defined in Title 
37, Code of Federal Regulations, Section 1.56(a) which occurred between the filing date of the prior 
application and the national or PCT international filing date of this application. 



PCT/FROO/01734 
(Application Serial No.) 



(Application Serial No.) 



(Application Serial No.) 



22 JUNE 2000 
(Filing Date) 



(Filing Date) 



(Filing Date) 



Pending 

(Status - patented, pending, abandoned) 



(Status - patented, pending, abandoned) 



(Status - patented, pending, abandoned) 




ereby appoint ^LAKEjX,SOJ<OLOFF^^^ a firm including : 

Keith G. Askoff,^e'grNo. J3528; Aloysius T.C. AuYeung, Reg. No. 35,432 ; Bradley J. Bereznak, Reg. No. 
33J74; Michael A. Bernadicou, Reg. No. j5,934 ; Roger W. Blakely, Jr.; Reg. No. ^5,831; Timothy R. Croll, 
Reg. No . 36.77 1: Daniel M. De Vos, Reg. No. 37,813 ; Scott A. Griffin, Reg. No ^38,167 ; Stephen D. Gross, 
Reg. No. 3XSi2Q; David R. Halvorson, Reg. No. 33.395: Michael D. Hartogs, Reg. No. 36,547; Brian D. 
Hickman, Reg. No. 35,89 4: George W. Hoover II, Reg. No. 32,992 ; Paul H. Hostmann, Reg. No. 36.167^ 
Eric S. Hyman, Reg. No . 30,1 39: Dag H. Johansen, Reg. N o^ 36,1 7 2; Stephen L. King, Reg. No. 19,180 ; 
Joseph T. Lin, Reg. No. 38,225 ; Michael J. Mallie, Reg. No. 36,591 ;"James D. McFariand, Reg. No ^ 32,544 ; 
Anthony C. Murabito, Reg. N o. 35,29 5; Kimberley G. Nobles? Reg7N o. 38,25 5; Ronald W, Reagin,^Reg. No. 
20,340 ; Kent R. Richardson, Reg. No. P-39.44 3: James H. Salter, Reg. N o. 35,66 8; William W. Schaal, Reg. 
No. P-3aJll§; James C. Sheller, Reg. Nn j^l^lQR; Edward W. Scott IV, Reg. No. 36,00 0; Maria E. Sobrino, 
Reg. No. 31,639 ; Stanley W. Sokoloff, Reg. No. aSOZft; Allan T. Sponseller, Reg. No. 38,3 18; John C. 
Stattler, Reg. No. 36,28 5; Edwin H. Taylor, Reg. No. 2hJL2B; Lester J. Vincent, Reg. No. 31,46 0; Ben J. 
Yorks, Reg. No. 33,609 ; and Norman Zafman, Reg. No. ^6. 2 50; my attorneys; and William D. Davis, Reg. 
No._aa*A2§; Gary B. Goates, Reg. No. 35,15 9; Soyeon P. Laub, Reg. No. £=39,266; Thomas X. Li, Reg. No. 
STJUSLi^Qrid Edwin A. Sloane, Reg. No. 34,728; my patent agents, with offices located at Ji2400JA|i^^ 
^oulevard, 7'^ Floor, Los^AngeJe^^J^^^^ telephone (310) 207-3800, with full power of 

substitution ~aTiiJ'7eN^^^ "to prosecute this application and to transact all business in the Patent and 
Trademark Office connected herewith. 



I hereby declare that all statements made herein of my own knowledge are true and that all statements made 
on information and belief are believed to be true; and further that these statements were made with the 
knowledge that willful false statements and the like so made are punishable by fine or imprisonment, or both, 
under Section 1001 of Title 18 of the United States Code and that such willful false statements may 
jeopardize the validity of the application or any patent issued thereon. 



j-Tf5D Full Name of Sole/First Inventor: DELEAM David 
Inventor's Signature: 




Residence: RERBOSJoUiRECL^/ FRANCE 
(City, State) 



Date: 1 ] JAN. 2002 

Citizenship: FR 

(Country) 



Post Office Address: 11, Rue du Marechal Leclerc 22700 PERROS GUIREC / FRANCE 



'^QO Full Name of Second/Joint Inventor : KOVESI Balazs 
Inventor's Signature: 

Residence: .LANNION / FRANCE %3^^ 
(City, State) 



Date: I I JAN. 2002 

Citizenship: HU 

(Country) 



Post Office Address: 12 Residence Coriay 22300 LANNION / FRANCE 



DECL US BLA 



'I 



Full Name of Third/Joint Inve 




lerre 



Inventor's Signature 



Date: \ \ 2QQ2 

Citizenship: FR 

(Country) 



Post Office Address: 10, Cite Zant Erwan 22220 MINIHY TREGUIER / FRANCE 



Full Name of Fourth/Joint Inventor: 
Inventor's Signature: 
Residence: 

(City, State) 



Date: 

Citizenship: 



(Country) 



Post Office Address: 



